Detect and repair grounding-file drift after schema changes#72
Conversation
Adds elapsed_seconds wall-clock timing on every LLM API call, aggregates it through the agent loop, and surfaces it in the QUERY TOTAL log line and provenance JSON as `elapsed=X.XXs out_tps=Y.Y`. Provider-agnostic — works for Ollama (whose OpenAI-compatible endpoint omits its native timing fields), Anthropic, OpenAI, and GitHub Models alike. The reported rate is `output_tokens / elapsed_wallclock`, which includes prefill + network and is therefore a lower bound on the model's true decode rate; callers that want a clean decode rate should probe the model directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two complementary pieces that catch the failure mode where
queries.yaml / schema_description.md / time_series.yaml fall out of
sync with the live database (typically after a `datasight tidy review`
reshape). When that happens the LLM agent is silently being trained on
wrong column names — hallucinating, refusing, or returning tables full
of zeros despite a clean execution.
Plan 1 — `datasight verify` static drift check
- New `datasight.grounding` library: AST-walks queries.yaml with
sqlglot, scans backticked identifiers in schema_description.md,
and validates time_series.yaml entries against the live schema.
- Suppresses false positives from prose enumeration values by
auto-loading distinct values of low-cardinality VARCHAR columns.
- Wired into `datasight verify` as a cheap pre-flight check. New
`--static-only` flag skips the LLM phase for fast drift-only
checks; `--skip-grounding-check` disables it.
Plan 2 — `datasight tidy review` post-apply repair hook
- New `datasight.grounding_repair` library: snapshots schema before
the tidy transform, calls the configured LLM with both old and
new schemas plus the drift report, validates every rewritten SQL
against the live DB (retries up to 2x on failures), and returns
proposed file contents without writing.
- Hook in `tidy review` runs after `apply_proposal` completes: shows
drift, prompts to repair, displays unified diff, prompts to apply,
and writes atomically via tempfile + os.replace.
- Scope limited to queries.yaml, schema_description.md, and
time_series.yaml — schema.yaml and measures.yaml stay owned by
tidy_review's existing update helpers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is ❌ Your patch check has failed because the patch coverage (71.35%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #72 +/- ##
==========================================
- Coverage 86.84% 85.86% -0.98%
==========================================
Files 61 64 +3
Lines 11562 12317 +755
==========================================
+ Hits 10041 10576 +535
- Misses 1521 1741 +220 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Adds an "Apple Silicon: MLX-native models" subsection to the choosing-an-llm concepts page with measured numbers from the resource monitor: - qwen2.5:7b (q4_K_M GGUF): ~2 GB resident — only option that fits 16 GB - gemma4:e2b-mlx-bf16: ~11 GB — NOT low-memory despite the "e2b" naming, because the default 256K KV-cache buffer dominates - qwen3.6:35b-a3b-coding-mxfp8: ~38 GB — best answer quality, sparse MoE works well on Apple Silicon's unified memory Tier table makes the recommendation explicit: 16 GB → qwen2.5:7b, 32 GB → qwen2.5:7b or gemma4 (tradeoff: terse vs. richer), 48 GB+ → qwen3.6:35b-A3B. Cross-references added from install.md and configuration.md so users landing on either page get pointed at the measured guidance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…est import - grounding.py: the _DEFAULT_GROUNDING_FILES tuple was a vestige of an earlier design pass where callers iterated over it; the final API takes each path as a keyword argument so the constant is dead code. - tests/test_grounding.py: pytest was imported but the tests use only assert statements (no raises/fixtures from pytest itself). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a static “grounding drift” detector plus an opt-in LLM-based repair flow to keep queries.yaml, schema_description.md, and time_series.yaml aligned with the live DB schema (especially after datasight tidy review reshapes). It also threads per-call wall-clock timing through the LLM stack to report token rates in logs and provenance.
Changes:
- Introduces
datasight.groundingfor static drift detection across grounding files and wires it intodatasight verify(--static-only,--skip-grounding-check). - Introduces
datasight.grounding_repairfor LLM-driven rewrite + validation + atomic write, and hooks it intodatasight tidy reviewpost-apply. - Adds client-side elapsed timing to LLM usage/cost reporting and updates docs with expanded local-model guidance.
Reviewed changes
Copilot reviewed 15 out of 15 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_grounding.py | Unit tests for static grounding drift detection and report formatting. |
| tests/test_grounding_repair.py | Unit tests for JSON parsing, retry/validation behavior, diffs, and atomic writes in repair flow. |
| tests/test_cli_tools.py | Updates cost-data tests to include elapsed timing + token rate fields. |
| src/datasight/llm.py | Measures per-API-call wall-clock time and exposes it via Usage.elapsed_seconds. |
| src/datasight/agent.py | Aggregates elapsed call time across the agent loop into AgentResult.total_elapsed_seconds. |
| src/datasight/cost.py | Adds elapsed + token-rate fields to cost summaries and logs when available. |
| src/datasight/cli.py | Plumbs elapsed/token-rate fields into logging and provenance. |
| src/datasight/grounding.py | New static drift checker + schema/enum helpers + drift report formatter. |
| src/datasight/grounding_repair.py | New LLM-driven repair orchestration, validation, diffing, and atomic write helper. |
| src/datasight/cli_commands/verify.py | Runs static grounding drift check before LLM verification; adds new CLI flags. |
| src/datasight/cli_commands/tidy.py | Post-apply drift detection + optional LLM repair prompt and atomic apply. |
| docs/use/how-to/install.md | Updates Ollama model guidance and Apple Silicon notes. |
| docs/use/concepts/choosing-an-llm.md | Adds MLX-native Apple Silicon guidance and benchmark table. |
| docs/reference/configuration.md | Expands OLLAMA_MODEL guidance and links to MLX section. |
| docs/reference/cli.md | Documents new datasight verify flags and behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The CLI ``datasight tidy review`` already calls grounding repair after ``apply_proposal`` completes; the web equivalent at ``/api/tidy/apply`` did not. This wires the same flow into the web endpoint: 1. Snapshot the pre-apply schema from ``state.schema_info`` before the transform runs (so the LLM can see both old and new shapes). 2. After the apply + existing schema.yaml/measures.yaml updates and the schema re-introspection, build the post-apply truth set and run the drift check. 3. When drift exists and an LLM client is configured, call ``repair_grounding``, validate each rewritten SQL against the live DB, and atomically write any files that pass validation. Failures are surfaced in the response but never fail the apply itself — the database mutation has already committed. 4. Surface the outcome as ``grounding_repair`` in the JSON response so the UI can tell the user what changed (or why nothing did). The repair runs *before* the schema_text rebuild because ``_load_user_description`` reads schema_description.md from disk; rewriting first ensures the next LLM call sees the repaired grounding. Tests cover the three branches the response shape distinguishes: applied successfully, skipped (no LLM client configured), and surfaced validation errors. ``repair_grounding`` is monkeypatched in tests since the FastAPI startup hook re-initializes ``state.llm_client`` from the environment and would clobber an in-place stub. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The post-apply grounding repair was previously locked behind a
synchronous call inside `tidy review` apply (CLI) and
/api/tidy/apply (web). When the LLM call timed out — common with
large prompts on local models — the user had no way to retry without
re-running the database transform. Web requests also blocked on the
slow LLM call, which made the apply UX feel unrelated to the actual
DB work it had already committed.
This commit splits repair from apply on both surfaces and adds the
missing retry path:
- Persist the pre-apply schema to .datasight/grounding_snapshot.json
on every CLI tidy review and web apply. Both surfaces and the new
standalone CLI command read it.
- New `datasight grounding {check, repair}` group. `repair` reads
the snapshot, runs the LLM rewrite, shows a diff, and prompts to
apply. `--model` overrides the configured model for retry,
`--from-csv` derives the pre-tidy schema from a source CSV when
no snapshot exists, `--dry-run` skips the write.
- Web /api/tidy/apply now returns immediately after the DB mutation
and includes a static drift summary. The slow LLM repair moved to
/api/tidy/grounding/repair, called from a banner the drawer shows
after apply. The endpoint falls back to the on-disk snapshot when
the in-memory copy is gone (post-restart).
- New /api/grounding/status + header pill so users see drift any
time it exists, not only after they just applied something.
- `tidy review` accepts --model so the same flag works on every
LLM-using command.
Docs:
- Per-workload model recommendation in choosing-an-llm.md: tool-use
commands (tidy review) favor general qwen3.6, long-form generation
commands (grounding repair) favor the coding-mxfp8 variant. Drops
the previous single-observation hedge now that we have evidence
from both LLM-using paths.
- New "Repair grounding" section in curate-with-tidy-review.md.
- cli.md regenerated with the new grounding group.
- Top-level CLI help layout: orphans (config, session, tidy,
grounding) moved into existing or new groups so there is no
generic "Commands" bucket.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
EM101 in the two new "must not run" assertions added by the test split — assign the message to a variable first, matching the project-wide convention. C901 noqa added to three functions intentionally over the McCabe threshold; pre-existing but uncovered when the previous CI run skipped them on path filters. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Run ruff format across grounding-touched files (10 files reformatted) - Stub llm_client inside TestClient context for the two grounding-repair endpoint tests so FastAPI startup's init_llm_client doesn't clear it in CI (no ANTHROPIC_API_KEY) - Sanitize stack-trace exposure: return generic messages from the grounding-repair endpoint instead of str(exc); keep server logs intact (CodeQL CWE-209) - Validate qualified column references against the referenced table's columns rather than the union, so renamed columns in one table aren't shadowed by their presence in another - Flag markdown table.column references where the table itself is missing instead of silently skipping them (the failure this check exists to catch) - Type-check group_columns before iterating in time_series drift check, matching load_time_series_config's contract - Quote identifiers via _quote_identifier and bound the enum-values scan with LIMIT max_per_column + 1 so a free-text column on a large table can't turn the pre-flight into a full scan - Make llm_retries semantics consistent: report the number of retries performed (attempt index on success, max_retries on give-up) - Exit with an error for --static-only when the DuckDB file is missing, instead of falling through to the LLM phase - Release state_lock around the slow LLM call in tidy_grounding_repair; re-acquire only for the small post-write critical section, with a project-switch guard so a concurrent project change can't be clobbered Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- verify: reject --static-only + --skip-grounding-check as mutually exclusive (exit 2). The static branch is gated on `not skip_grounding_check`, so the combination silently fell through to the LLM phase, contradicting --static-only's documented semantics. - tidy_apply: move pre-apply schema snapshot (old_schema capture, state.pre_tidy_schema assignment, write_snapshot) inside state_lock, so a concurrent project switch can't desync the snapshot from the DB the apply mutates. - tidy_grounding_repair: wrap the prerequisite checks and input snapshot in `async with state.state_lock` to match what the pre-existing "before releasing the lock" comment claimed. The slow LLM call still runs outside the lock; the final mutation section already re-acquires it and guards on project_dir_snapshot. - Add a CLI test covering the new flag rejection. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- tidy_grounding_repair: assign state.schema_info from the freshly
introspected `tables` before rebuilding schema_map. Previously
schema_text was built from `tables` while schema_map was built from
the pre-introspection schema_info, so the two could desync if the
DB schema ever drifted from schema_info at that point. Mirrors the
assignment order in tidy_apply.
- grounding_repair: loosen _FENCED_JSON so terse outputs like
```json\n{...}``` (no newline before the closing fence) match on
the first try instead of falling through to the bare-JSON path —
which previously left trailing backticks in the candidate string
and forced a wasted retry. Whitespace on either side of the body
is now optional.
- Add regression tests covering both fenced-block edge cases.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Track per-call wall-clock time and emit token-rate in cost summary
Adds elapsed_seconds wall-clock timing on every LLM API call, aggregates
it through the agent loop, and surfaces it in the QUERY TOTAL log line
and provenance JSON as `elapsed=X.XXs out_tps=Y.Y`. Provider-agnostic —
works for Ollama (whose OpenAI-compatible endpoint omits its native
timing fields), Anthropic, OpenAI, and GitHub Models alike. The reported
rate is `output_tokens / elapsed_wallclock`, which includes prefill +
network and is therefore a lower bound on the model's true decode rate;
callers that want a clean decode rate should probe the model directly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Detect and repair grounding-file drift after schema changes
Adds two complementary pieces that catch the failure mode where
queries.yaml / schema_description.md / time_series.yaml fall out of
sync with the live database (typically after a `datasight tidy review`
reshape). When that happens the LLM agent is silently being trained on
wrong column names — hallucinating, refusing, or returning tables full
of zeros despite a clean execution.
Plan 1 — `datasight verify` static drift check
- New `datasight.grounding` library: AST-walks queries.yaml with
sqlglot, scans backticked identifiers in schema_description.md,
and validates time_series.yaml entries against the live schema.
- Suppresses false positives from prose enumeration values by
auto-loading distinct values of low-cardinality VARCHAR columns.
- Wired into `datasight verify` as a cheap pre-flight check. New
`--static-only` flag skips the LLM phase for fast drift-only
checks; `--skip-grounding-check` disables it.
Plan 2 — `datasight tidy review` post-apply repair hook
- New `datasight.grounding_repair` library: snapshots schema before
the tidy transform, calls the configured LLM with both old and
new schemas plus the drift report, validates every rewritten SQL
against the live DB (retries up to 2x on failures), and returns
proposed file contents without writing.
- Hook in `tidy review` runs after `apply_proposal` completes: shows
drift, prompts to repair, displays unified diff, prompts to apply,
and writes atomically via tempfile + os.replace.
- Scope limited to queries.yaml, schema_description.md, and
time_series.yaml — schema.yaml and measures.yaml stay owned by
tidy_review's existing update helpers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Document measured per-model memory footprints for Apple Silicon
Adds an "Apple Silicon: MLX-native models" subsection to the choosing-an-llm
concepts page with measured numbers from the resource monitor:
- qwen2.5:7b (q4_K_M GGUF): ~2 GB resident — only option that fits 16 GB
- gemma4:e2b-mlx-bf16: ~11 GB — NOT low-memory despite the "e2b" naming,
because the default 256K KV-cache buffer dominates
- qwen3.6:35b-a3b-coding-mxfp8: ~38 GB — best answer quality, sparse MoE
works well on Apple Silicon's unified memory
Tier table makes the recommendation explicit: 16 GB → qwen2.5:7b, 32 GB →
qwen2.5:7b or gemma4 (tradeoff: terse vs. richer), 48 GB+ → qwen3.6:35b-A3B.
Cross-references added from install.md and configuration.md so users
landing on either page get pointed at the measured guidance.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Catches the failure mode where
queries.yaml,schema_description.md, andtime_series.yamlfall out of sync with the live database — typically after adatasight tidy reviewreshape. When that happens the LLM agent is silently trained on wrong column names: it hallucinates, refuses, or returns tables full of zeros despite a clean execution. Discovered while benchmarking local Ollama models against a wide→long reshape: every model produced 0/5 useful answers until the grounding files were rewritten by hand.Two pieces, both opt-in at user-controlled checkpoints:
Plan 1 —
datasight verifystatic drift check (no LLM)datasight.groundinglibrary: AST-walks queries.yaml with sqlglot, scans backticked identifiers inschema_description.md, and validatestime_series.yamlentries against the live schema. Suggests nearest matches by edit distance.Values: `pacific`, `mountain`, ….datasight verifyas a cheap pre-flight check that runs before the LLM phase. New--static-onlyflag for fast drift-only checks;--skip-grounding-checkdisables it.Plan 2 —
datasight tidy reviewpost-apply repair hook (LLM, opt-in)datasight.grounding_repairlibrary: snapshots schema before the tidy transform, calls the configured LLM with both old and new schemas plus the drift report, validates every rewritten SQL against the live DB (retries up to 2x on failures), and returns proposed file contents without writing.tidy reviewruns afterapply_proposalcompletes: shows drift, prompts to repair, displays a unified diff, prompts to apply, and writes atomically via tempfile +os.replace.schema.yamlandmeasures.yamlremain owned bytidy_review's existing update helpers.Also included: per-call token-rate reporting
The first commit on this branch threads wall-clock timing through the agent loop and surfaces it in the
QUERY TOTALlog line aselapsed=X.XXs out_tps=Y.Y. Provider-agnostic — works for Ollama (whose OpenAI-compat endpoint strips native timing fields), Anthropic, OpenAI, and GitHub Models alike. Helped quantify the impact of the grounding fix during benchmarking.Test plan
pytest tests/test_grounding.py— 20 unit tests covering drift detection, CTE/output-alias false positives, qualified table.column resolution, enum allowlist suppression, time_series checks, parse errorspytest tests/test_grounding_repair.py— 18 tests covering JSON parsing (bare/fenced/non-object), atomic writes (validation-error files left untouched), happy-path repair, retry-on-bad-SQL, max-retries-exceeded fallbackpytest -q— full suite 1535 passed, 1 deselected (pre-existing live-LLM test needsqwen2.5:7bOllama model installed, unrelated)datasight verify --static-onlyagainst a known-drifted project flags the stale references and exits 1; against a clean project reports clean and exits 0🤖 Generated with Claude Code